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1 Some History 

Historically, many believe that these three papers [7, 8, 9] established the techniques of Probabilistic Latent 
Semantic Analysis or PLSA for short. However, there also exists one variant of the model in [11] and 
indeed all these models were originally discussed in an earlier technical report [10]. In [2], the authors 
extended MLE-style estimation of PLSA to MAP-style estimations. A hierarchical extension was proposed 
in [6]. In [4], the authors showed the equivalent between PLSA and another popular method, non-negative 
matrix factorization. A high order of proof was shown in [12]. The equivalent between PLSA and LDA 
was shown in [5]. More recently, a new MAP estimation algorithm is proposed in [13]. 



2 A Modern View of PLSA 



In order to better understand the intuition behind the model, we need to make some assumptions. First, 
we assume a topic <p% is a distribution over a fixed size of vocabulary V. In the original PLSA model, this 
distribution is not explicitly specified but the form is in Multinomial distribution. Thus, <pk is essentially a 
vector that each element <p(k /W \ represents the probability that term w is chosen by topic k, namely: 

p(w\k) = (p {Kw) (1) 

and note Ysw<P(k,w) = 1- Secondly, we also assume that a document consists of multiple topics. Therefore, 
there is a distribution 6^ over a fixed number of topics T for each document d. Similarly, original PLSA 
model does not have the explicit specification of this distribution but it is indeed a Multinomial distribution 
where each element 0Mjt) in the vector 6^ represents the probability that topic k appears in document d, 
namely: 

P(k\d) = 6 m (2) 
and also Omja = 1. This is the prerequisite of the model. 
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PLSA can be considered as a generative model, although it is not strictly the case [1]. Before we start, there 
is one subtle issue needs to be pointed out. That is the difference between a term iv in the vocabulary V 
and a token position dj in a document d. Terms in the vocabulary are distinct, meaning that all the terms 
differ from each other. Token positions are the places where terms are realized. Therefore, a term could 
appear multiple times in a same document d in different token positions. 

Imagine someone wants to write a document, he needs to decide which term to choose for each token 
position in a document d. For z'-th position, he first decides which topic he wants to write, according to 
the distribution 9^. In this step, he essentially flips a T- side dice since is a Multinomial distribution. 
Once the outcome of decision is made, suppose it is topic k, he then chooses a term, according to the 
distribution <^/ c . Similarly, a V-side dice is flipped. This two step generation process is repeated for all 
token positions and for all documents in the dataset. 

The generation process can be summarized as follows: 

• For each document d 

- For each token position i 

Choose a topic z ~ Multinomial^) 
Choose a term w ~ Multinomial^) 

and we can write the probability a term iv appearing at token position i in document d as follows: 

r 

P(4i = H*' 6 d) = E <P{z,w) d {d r z) ( 3 ) 
z=k 

and the joint likelihood of the whole dataset W is: 

D N d T 

p(wi$,0) = nnEw^) 

d i z=k 

DVT \n(d,w) 
d iv z=k 

where n(d, w) is the number of times term w appearing in document d. 

In the formalism above, the likelihood depends on parameters <J> and 0, which needs to be estimated from 
data. Here, we wish to obtain the parameters that can maximize the above likelihood. Therefore, we have: 



are max 

6 ■£,© 



D T TV 



iog P (w|$,0) + _ Y,e m ) + J>(i - 

d z 2 w 



(5) 



where the second and the third part of the equation is Lagrange Multipliers to guarantee Multinomial 
parameters in range [0,1]. 



It is difficult to directly optimize the above equation due to the log sign is out of a summation. EM (Expec- 
tation Maximization) [3] algorithm is employed here to estimate these parameters. The key assumption to 
apply EM algorithm is that we know for each token position which topic is chosen from. In other words, 
for each token position, we know z value. Note, we just pretend we know these values. We denote R Wdj 
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to represent which z is chosen for token position di in document d. Thus, R Wdj is a T dimensional vector 
where YLk R{w di ,k) = 1- This also indicates that each R Wdi is in fact a valid distribution and R is a matrix 
where each row entry is a R w .. . We plug all these hidden variables into the likelihood function: 

D N d T 

C = logp(W|R,O,0) = EEE^vz) ( lo ^(z,rv di ) +loge (rf/Z) (6) 

d di z 

and our new objective function is as follows: 

D T TV-. 

argmaxA= log p(W|R, <*>,©) +£^(1 - £0( d/2 )) + - £>(*,„,)) (7) 

<P ' fc ' rf Z Z fP J 

For a standard E-step in EM algorithm, we compute the posterior distribution of hidden variables, given 
the data and the current values of parameters: 



<JW)> = P( R (w di ,k) = 1|W,0,O) 

p(w,V) = 1 l®'*) 



EiEp(w /J R (IBiH j fc) = i|® / *) 

P(^#(fc,n; d ,))p(^l^) 

Efc P(^l<P(M^))K fc l^) 
<P(k,w dl )8(d,k) 

Yl<P{k,w di )^{d,k) 

In M-step, we obtain the new optimal values for parameters given the current settings of hidden variables. 
For 9 d , we have: 



3A ^ < 



E— p^--A d = 



7 (rf,z) di *(d,z) 

3A 1 ffl 



(9) 



Solving the above two equations, we obtain: 



Similarly, for <p z , we have: 



_ Edi < %u di ,z) > 
0(rf,z) - (10) 



3A D,^<R (W(liZ)> I(H; < , i = w) 



EE W 'T — ^ = 



3A 



1/ 



= 1 "E < ?'(^)= 

7,1 



(11) 
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Solving the above two equations, we obtain: 

r(z,iv) v „Dr- N rf ^ d it/ A " ' 

Hi' U LJ < R (w di ,z)>H™di = W) 

Note, we can simplify the notation of EM step. Notice that for all token positions of a same term w in a 
same document d, E-step is essentially same and therefore simplified E-step is: 

<R ( ( d) n >= t ik ' w) ° m (13) 

(w,k) y^T j. f\ v ' 



and simplified M-step is: 



\d,k) 



(k,zv) 



N d 



Z°n(d,w)<RW> 



YZ>Lfn(d,w>)<RW, k)> 



(14) 



3 Discussion on EM Algorithm 



In the above discussion, there is one subtle detail that needs more space to be clarified. We introduced 
R (w di ,k) as indicator variables to indicate which topic is chosen for token position di. Although it satisfies 
St R (w d -k) = 1/ this vector essentially only has one element equal to 1. However, when we calculate E-step 
of the inference algorithm, we calculate < R(w di/ k) >' m e posterior distribution of hidden variables, given 
the data and current settings of parameters. Here, < Ri w ,.ja > is a distribution and it has probabilities in 
each element of the vector but still satisfies Yjc < R (w di ,k) > = !• What really leads to this difference? 

We re-write the log likelihood of one token position after we introduce the indicator variables as follows: 



logE R (^) (<P(k,zv d ,)8(d„ 



l,k) 



(15) 



We introduce an auxiliary distribution q(Rr Wcr ,k)) — R{ R (w di ,k) = 1) an d therefore JLk<]{R{w di ,k)) = !• P m g 
this auxiliary distribution into the above log likelihood, we obtain: 



X R (^i,k) (<P(k,w di ) e (d,k)) 



<?( R Kt)) =logE^ 



R, 



(W di ,k) \<P[k,lD di f(d,k) 

q( R (iu di ,k)) 



(16) 



By using Jensen's Inequality, we can move the log sign into the expectation and make a lower bound of 
our original log likelihood: 



rR 



logE^ 



(w di ,k) (<P(k,w di )Q(d,k)) 



q( R (w di ,k)) 



> E, 

> E, 



Rt 



log 



jw di , k) \V(k,w dl )°{d,k) 
l( R (zv d „k)) 
l°§( R (w di ,k)(<P(k,w di )0(d,k) 



iogq( R (w di ,k)) 



(17) 
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Now, our goal is clear. Since it is hard to directly optimize the left hand side, we need to maximize the 
lower bound, right hand side, as much as possible: 

E?( R (w**)) lo s( R (^) yP(k,w dl )9{cl,k)) ) - E?(*W*)) lo g?( R (w di ,ifc)) + A ( ! " E?( R («p**))) < 18 ) 

Jc /c /c 

Taking the derivatives respect to q(Rr w ja) and setting to 0, we have: 

lo g(^K„ic) ( < P(k,io di )°(d,k)) ) - tog?(K( W|B *)) - 1 - A = (19) 

Solving this, we obtain: 

few - r r ( r""r < 2 °' 

Lfc <P{k,w ii )V{d,k) 

It is exactly E-step we obtained in the previous section. Note, q(R( Wdj ,k)) is indeed < Rlw di Jc) > an< ^ we 
understand that EM algorithm here in a lower bound maximization process. 



4 Original Formalism of PLSA 

In original proposed PLSA by Thomas Hofmann [7, 8, 9], there are two ways to formulate PLSA. They are 
equivalent but may lead to different inference process. 

P(d,iv) = P(d)J^P(w\z)P(z\d) (21) 

z 

P(d,w) = ^P(w\z)P(d\z)P(z) (22) 

z 

Let's see why these two equations are equivalent by using Bayes rule. 

P{z\d) = P ^ P ^ 



P(d) 

P(z\d)P(d) = P(d\z)P(z) 
P(w\z)P(z\d)P(d) = P(w\z)P(d\z)P(z) 
P(d) J^P(w\z)P(z\d) = J^P(zo\z)P(d\z)P(z) 

z z 

The whole data set is generated as (we assume that all words are generated independently): 

D = YlYlP(d,w) n{d ' w) (23) 

d w 

The Log-likelihood of the whole data set for (1) and (2) are: 

Li = ^n(d,zv)log[P(d)£P(io\z)P(z\d)] (24) 

d w z 

L 2 = ^n{d,w)\o^P{w\z)P{d\z)P{z)] (25) 

d JO z 
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5 EM 



For Equation 24 and Equation 25, the optimization is hard due to the log of sum. Therefore, an algorithm 
called Expectation-Maximization is usually employed. Before we introduce anything about EM, please 
note that EM is only guarantee to find a local optimum, although it may be a global one. 



First, we see how EM works in general. As we shown for PLSA, we usually want to estimate the likelihood 
of data, namely P(X\9), given the paramter 8. The easiest way is to obtain a maximum likelihood estima- 
tor by maximizing P(X\8). However, sometimes, we also want to include some hidden variables which are 
usually useful for our task. Therefore, what we really want to maximize is P(X\6) = £ z P(X\z, 8)P{z\8), 
the complete likelihood. Now, our attention becomes to this complete likelihood. Again, directly maxi- 
mizing this likelihood is usually difficult. What we would like to show here is to obtain a lower bound of 
the likelihood and maximize this lower bound. 



We need Jensen's Inequality to help us obtain this lower bound. For any convex function f(x), Jensen's 
Inequality states that : 



A/(x) + (1 " A)/(y) > /(Ax + (1 - A)y) 
Thus, it is not difficult to show that : 

W (*)] = E p W/W ^ f(L p (x)x) = /(EM) 

X X 

For concave functions (like logarithm), Jensen's Inequality should be used reversely as: 

E[f(x)]<f(E[x}) 



(26) 



(27) 



(28) 



Back to our complete likelihood, we can obtain the following conclusion by using concave version of 
Jensen's Inequality : 



log£P(X|z,0)P(z|0) = log£P(X|z,0)P(z|0) 



q(z) 



l0gE«: 



P(X\z,e)P(z\6) 



log 



q(z) 
P{X\z,0)P(z\9) 



q{z) 



(29) 
(30) 
(31) 



where is expectation with respect to q(z). Therefore, we obtained a lower bound of complete likelihood 
and we want to maximize it as tight as possible. EM is an algorithm that maximize this lower bound 
through an iterative fashion. Usually, EM first would fix current 8 value and maximize q(z) and then use 
the new q{z) value to obtain a new guess on 8, which is essentially a two stage maximization process. The 
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first step can be shown as follows: 



IE,; 



log 



p(x\z,e)p(z\o) 



q(z) 



E<7( z ) lo s 

z 

E<7( z ) lo § 



p(x\z,e)p(z\e) 

q(z) 

P(z\X,6)P(X\6) 
q{z) 



E?(z)logP(X|0)+£ 9 (z)log 



P(z|X,( 
q{z) 



logP(X|0)-E9(z)log 

z 

logP(X|0)-E ( ,[log 



P(z|X,0) 
«7(z) 



P(z|X,6 

logP(X|0) -Klf^z) ||P(z|X,0) 



The first term does not contain z. Therefore, in order to maximize the whole equation, we need to 
minimize KL divergence between q{z) and P(z|X, 0), which eventually leads to the optimum solution of 
^(z) = P(z|X, 0). So, usually for E-step, we use current guess of 8 to calculate the posterior distribution 
of hidden variable as the new update score. For M-step, it is problem-dependent. We will see how to do 
that in later discussions. 



We also show another explanation of EM in terms of optimizing a so-called Q function. We devise the 
data generation process as P(X\8) = P(X,H\8) = P(H\X,8)P(X\8). Therefore, the complete likelihood is 
modified as: 

L c (0) = logP(X,H|0) = logP(X|0)+logP(H|X,0) = L(0) + logP(H|X,0) 

Think about how to maximize L c (8). Instead of directly maximizing it, we can iteratively maximize 

L c (0(" +1 ))-L c (0W) as : 

L{8) - L(flW) = L c (0) - logP(H|X,0) - L c (0W) + logP(H|X,0( n )) 

Now take the expectation of this equation, we have: 

L{8) - L(0(")) = J2L c (8)P(H\xM n) ) - El C (g ( " ) )P(H|X,0W)+EP(H|X,0W)l O g P | ) ^ ( " )) 

The last term is always non-negative since it can be recognized as the KL-divergence of P(H|X, 8^ and 
P(H|X, 0). Therefore, we obtain a lower bound of Likelihood : 

L(8) > EM0)P(H|X,0 W ) + L ( 0{ " ] ) ~ EM £)( " ) ) p (- H |X,0 ( " ) ) 

H H 

The last two terms can be treated as constants as they do not contain the variable 0, so the lower bound is 
essentially the first term, which is also sometimes called as "Q-function". 

Q(0;0(")) = E(L c (0)) = J2 L c(0)P(H\X,8^) (32) 

H 
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5.1 EM of Formulation 1 



In case of Formulation 1, let us introduce hidden variables R(z,w,d) to indicate which hidden topic z is 
selected to generated w in d where Yl z R{ z > w i d) = 1- Therefore, the complete likelihood can be formulated 
as: 



L c i =J2Y2 n ( d ' w )Il R ( z > w ' d ) lo g P{d)P(w\z)P(z\d) 



d w 



J2J2 n ( d ' w )J2 R ( z ' zv ' d ) logP(d) + logP(z<;|z) +logP(z | d) 



d w 



From the equation above, we can write our Q-function for the complete likelihood E[L c i]: 
E[L cl ] = Y^Y^n(d,iv)J^P{z\iv,d) [logP(d) + logP(> | z) + log P(z | d) 

d w z 

For E-step, we obtain the posterior probability for latent variables: 

„, , Piw.z.d) 
P(zw,d) = ' ' 

P(w,d) 

P(w\z)P(z\d)P(d) 

~ E z P(™\ z ) p ( z \ d ) p ( d ) 

P(w\z)P{z\d) 
~ Y^Plj^z)pJz\d) 

For M-step, we need to maximize Q-function, which needs to be incorporated with other constraints: 
H = E[L cl ] + «[1 - £P(i)] + /$£[! -£P(w\z)] + 7EI 1 " L p ( z \ d )} 



where a, B and 7 are Lagrange Multipliers. We take all derivatives: 

dH P{z\w,d) 

-> ££n(<Z,H>)P(z|?»,d) - aP(d) = 

a; z 

—. , . = Yn(d,w) „ ' -8 = 
3P(w|z) & P(w\z) p 

-> £n(d,ro)P(z|w,d) - 6P(zt>|z) = 
3H „ ^ .P(zltM) 



£n(d,w)P(z|iM) - ?P(z|d) = 
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Therefore, we can easily obtain: 

P(d) 



E w E z n(d,w)P(z\w,d) 
<,d) 

(33) 



LdLwLz n ( d > w ) p ( z \ w > d ) 

n(d) 



_ Z d n(d,w)P(z\w,d) 
H ' > ~ Z w EMd,™)P(z\w,d) (34) 



P(z\d) 



E d n(d) 

Z w 'Ld n 

Y Jlv n{d,w)P{z\w,d) 
LzL lv n(d,w)P(z\w,d) 
Y JW n{d,w)P{z\w,d) 

W) 



(35) 



5.2 EM of Formulation 2 

Use similar method to introduce hidden variables to indicate which z is selected to generated w and d and 
we can have the following complete likelihood : 

L C 2 = EL n ( d ' zv )L R ( z > iv ' d ) l °g{ p ( z ) p ( io \ z ) p ( d \ z )} 

d w z 

= J2J^n(d,iu) J^R(z,iv,d) logP(z) + logP(w|z) + logP(d|z) 

d w z 

Therefore, the Q-function E[L C 2] would be : 

E t L c2] =J^J^n(d,w)J^P(z\w,d)[logP(z) +logP(ro|z) + logP(d|z)] 

d -f z 

For E-step, again, we obtain the posterior probability for latent variables: 

P(w, z,d) 



P{z\w,d) 



P{w,d) 

P{w\z)P{d\z)P{z) 



^ z P{w\z)P{d\z)P{z) 
For M-step, we maximize the constraint version of Q-function: 

H = E[L c2 ] +41 - £P(z)] + J 6£[1 -£P(w\z)] + tEI 1 " L p ( d \ z )] 

z z w z d 
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where a, f> and 7 are Lagrange Multipliers. We take all derivatives: 

dH P{z\w,d) 

-> £)^n(d,rp)P(z|w,d)-aP(z) = 
8H r- s P(z|ro,(2) „ „ 



3P(ro|z) Y P(w|z) 

-> J3n(d,o;)P(z|ro,d) - j8P(ro|z) = 

8H „ _ ,P(z|w,d) 

-> £n(d,w)P(z|tM) - TP(d|z) = 



Therefore, we can easily obtain: 



P(z) 



EdEwn{d,w)P(z\w,d) 

Ed EwEzn{d,w)P{z\w,d) 

Ed E w n{ d rW)P{A w > d ) 
EdEi n{d,w) 



(36) 



_ Edn{d,w)P{z\w,d) (3?) 
1 1 j "E d E^(d^)P(z| W ; / d) ( ^ 



6 Incorporating Background Language Model 

Another PLSA model which incorporates background language model is usually formulated like this : 

P{d,w) =X B P{w\e B ) + {l-X B )Y J P{w\z)P{z\d)P{d) (39) 

z 

The log likelihood of Equation 7 is 

1 = LL n ( d ' w ) log[A B PH0 B ) + (1 - A B ) jrp(to\z)P(z\d)P(d)] 

d IV z 

Let's again introduce a hidden variable P(Z ( j a) ) to indicate which component that the w and are gen- 
erated while P(Zrf j„ = 8 B ) means that the word is generated by the background model and P(Z^ w — j) 
meaning the word is generated by the topic Zj. Thus, the complete log likelihood is : 

L c = Y^£n{d,w)[P(Z diW = 9 B )\og(A B P(w\8 B ))+Y2 P ( Z d,u, = z\Z d , w ^ 9 B )\o S ((l-\ B )P(w\z)P(z\d)P(d))] 

d 10 z 
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The E-step is straightforward. Using Bayes Rule, we can obtain: 

p(w\e B ,d) 



p(z diW = e B \d,w) 



P{Z diW = z\d,w) 



P(iu,d) 

^bPMOb) 



X B P{w\d B ) + (1 - A B ) E z P(w\z)P(z\d)P(d) 
P{w\z,d) 



P{w,d) 
P{w\z)P{z\d)P{d) 
~ Y, z P{w\z)P(z\d)P(d) 

P{w\z)P{z\d) 
~ £ z P(zu\z)P(z\d) 

For M-step, we maximize the constraint version of Q-function: 

H = E[L C ] + fil -£P(w\z)] + 7 [1 - E p ( z l d )l 



and take all derivatives: 



dH V- (A N^d^=z) R n 

-£n(d,w) -f$ = 



dP{zv\z) d P(w\z) 

^J^n(d,w)P(Z d , w = z)-pP(w\z) = 

d 

dH „ ^ P(Z diW = z) 

-£n(d,w) ' -7 = 



3P(z|d) V p ( z l d ) 

-> j2n(d,w)P(Z drW = z) - 7 P(z\d) = 

w 

Therefore, we can easily obtain: 

Ldn(d f w)(l - P{Z d/W = d B \d,w))P(Z diW = z) 



P(w\z) 
P(z\d) 



YLwYLd n \d' w )\ l -P( z d,w = ^B\d,w))P{Z diW = z) 
E w n(d,zv)(l - P(Z d/W = e B \d,w))P{Z diW = z) 



EzE w n(d,zv){l - P(Z diW = 6 B \d,zv))P(Z d/W = z) 
Note, P(zv\6 B ) is only sampled once by using the equation: 

LwLdn(d,w) 

If we change to the PLSA Formulation 2, we will get the following E steps: 

P(w\0 B ,d) 



P(Z d/lv = 9 B \d,zv) 



P(Z d/W = z\d,w) 



P(zv,d) 

x B p(w\e B ) 



A B P(w\9 B ) + (1 - Ag) E z P(zv\z)P(d\z)P(z) 
P(w\z,d) 
P(w,d) 

P{w\z)P{d\z)P(z) 
E z P(u>\z)P(d\z)P(z) 
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and corresponding M steps: 



Ldn(d,zv){l - P{Z d)W = e B \d,w))P{Z d>lu = z) 
Ez„Ed"(d,w)(l - P{Z AiW = e B \d,w))P{Z diW = z) 

E w n(d,w)(l-P(Z d/W = 6 B \d,w))P(Z d/W = z) 
LdLwn(d,w)(l - P(Z d/W = 6 B \d,zv))P{Z d/Z0 = z) 
LdLwn(d,iv)(l - P(Z d/W = B \d,w))P{Z d/W = z)) 
Ld Lw E z « {d, w){l - P(Z d/W = 6 B \d, w))P(Z d/W = z) 
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